Introduction:

This Data-visualization project is composed of California Housing Prices from the 1990 Census of the State of California.

  1. California Housing Prices

The objective is to make use of the toolset and principles of data visualization, displaying and uncovering trends, patterns, tendencies, and outlieres, using ggplot for R, this report will create:

  1. Data transformation using functions like, filter, select, group_by and other.

  2. Bar charts,line charts, and others.

  3. Scatter plots, histograms.

  4. Dashboards.

  5. Gggplot is the library used

  6. Coding language is R.

  7. Rstudio is the integrated development environment.

  8. For spatial visualization the package is SF.

  9. Fitting of a Linear Regression Analysis.

The California Housing Prices contains the median house prices of California from the 1990 census. Here’s a short summary of each term:

Initially, I was drawn to the dataset of Houses in the West Roxbury neighborhood, yet when I started analyzing it does not have the spatial data needed to create a map plot. in addtion to the process of:

I want to tell the history about how the prices move according to the distance to the ocean, this information is contained in the California Housing Prices dataset.

By taking this class, I have started to use the principles of data visualization as we are learning them:

  1. Know Your Audience
  1. Choose the Right Chart Type
  1. Simplify and Remove Clutter
  1. Use Color Purposefully
  1. Provide Context
  1. Show the Data Honestly
  1. Tell a Story

Data Summary

Data summary is a concise way to describe and understand the main features of a dataset. It helps to quickly grasp patterns, trends, and outliers without going into the raw data.

##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  ocean_proximity    median_house_value
##  Length:20640       Min.   : 14999    
##  Class :character   1st Qu.:119600    
##  Mode  :character   Median :179700    
##                     Mean   :206856    
##                     3rd Qu.:264725    
##                     Max.   :500001    
## 

Ploting data distribution

Histogram is a graphical representation of the distribution of a numeric dataset. It groups data into intervals (called bins) and shows how many values fall into each bin. Why we want to plot an histogram of the data? To understand the distribution (normal, skewed, bimodal, etc.)

  • To spot outliers or clusters

  • To summarize large datasets

  • To compare shapes of distributions across groups

Graph of the Distribution of Houses by Distance to Ocean.

It is likely that the Island feature is skewing the data due to its value feature.


Correlation Table

Pearson Correlation measures the linear relationship between two variables. Range: from -1 to +1

  • +1 = perfect positive correlation

  • 0 = no correlation

  • -1 = perfect negative correlation

Does correlation always imply causation? Correlation is often the first clue, but to imply causation, we usually need:

  • Controlled experiments

  • Longitudinal studies

  • Strong theoretical support

  • Ruling out confounders

Still correlation means two variables are related — when one changes, the other tends to change as well in positive or negative direction.

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
longitude 1.0000000 -0.9246644 -0.1081968 0.0445680 NA 0.0997732 0.0553101 -0.0151759 -0.0459666
latitude -0.9246644 1.0000000 0.0111727 -0.0360996 NA -0.1087847 -0.0710354 -0.0798091 -0.1441603
housing_median_age -0.1081968 0.0111727 1.0000000 -0.3612622 NA -0.2962442 -0.3029160 -0.1190340 0.1056234
total_rooms 0.0445680 -0.0360996 -0.3612622 1.0000000 NA 0.8571260 0.9184845 0.1980496 0.1341531
total_bedrooms NA NA NA NA 1 NA NA NA NA
population 0.0997732 -0.1087847 -0.2962442 0.8571260 NA 1.0000000 0.9072223 0.0048343 -0.0246497
households 0.0553101 -0.0710354 -0.3029160 0.9184845 NA 0.9072223 1.0000000 0.0130331 0.0658427
median_income -0.0151759 -0.0798091 -0.1190340 0.1980496 NA 0.0048343 0.0130331 1.0000000 0.6880752
median_house_value -0.0459666 -0.1441603 0.1056234 0.1341531 NA -0.0246497 0.0658427 0.6880752 1.0000000

Correlation Graphs

Spatial Visualization Graph

Interactive Plot

Map Animation

One Hot Encoding Ocean Proximity Correlation Table

As I want to focus on the relationship between the Median House Value and Median House Price, I need to transform de data, and peform One Hot Encoding: One Hot Encoding transforms each category value into a new binary (0 or 1) column.

##                              longitude    latitude housing_median_age
## longitude                  1.000000000 -0.92466443        -0.10819681
## latitude                  -0.924664434  1.00000000         0.01117267
## housing_median_age        -0.108196813  0.01117267         1.00000000
## total_rooms                0.044567978 -0.03609960        -0.36126220
## total_bedrooms                      NA          NA                 NA
## population                 0.099773223 -0.10878475        -0.29624424
## households                 0.055310093 -0.07103543        -0.30291601
## median_income             -0.015175865 -0.07980913        -0.11903399
## median_house_value        -0.045966615 -0.14416028         0.10562341
## ocean_proximityINLAND     -0.055574654  0.35116598        -0.23664459
## ocean_proximityISLAND      0.009445503 -0.01657165         0.01701984
## ocean_proximityNEAR.BAY   -0.474488910  0.35877099         0.25517166
## ocean_proximityNEAR.OCEAN  0.045508838 -0.16081792         0.02162156
##                            total_rooms total_bedrooms   population   households
## longitude                  0.044567978             NA  0.099773223  0.055310093
## latitude                  -0.036099596             NA -0.108784747 -0.071035433
## housing_median_age        -0.361262201             NA -0.296244240 -0.302916009
## total_rooms                1.000000000             NA  0.857125973  0.918484493
## total_bedrooms                      NA              1           NA           NA
## population                 0.857125973             NA  1.000000000  0.907222266
## households                 0.918484493             NA  0.907222266  1.000000000
## median_income              0.198049645             NA  0.004834346  0.013033052
## median_house_value         0.134153114             NA -0.024649679  0.065842651
## ocean_proximityINLAND      0.025624325             NA -0.020732123 -0.039402469
## ocean_proximityISLAND     -0.007571767             NA -0.010412114 -0.009077005
## ocean_proximityNEAR.BAY   -0.023022417             NA -0.060880154 -0.010093339
## ocean_proximityNEAR.OCEAN -0.009175150             NA -0.024263727  0.001714434
##                           median_income median_house_value
## longitude                  -0.015175865        -0.04596662
## latitude                   -0.079809127        -0.14416028
## housing_median_age         -0.119033990         0.10562341
## total_rooms                 0.198049645         0.13415311
## total_bedrooms                       NA                 NA
## population                  0.004834346        -0.02464968
## households                  0.013033052         0.06584265
## median_income               1.000000000         0.68807521
## median_house_value          0.688075208         1.00000000
## ocean_proximityINLAND      -0.237495762        -0.48485933
## ocean_proximityISLAND      -0.009228171         0.02341608
## ocean_proximityNEAR.BAY     0.056196803         0.16028448
## ocean_proximityNEAR.OCEAN   0.027343611         0.14186217
##                           ocean_proximityINLAND ocean_proximityISLAND
## longitude                           -0.05557465           0.009445503
## latitude                             0.35116598          -0.016571648
## housing_median_age                  -0.23664459           0.017019840
## total_rooms                          0.02562432          -0.007571767
## total_bedrooms                               NA                    NA
## population                          -0.02073212          -0.010412114
## households                          -0.03940247          -0.009077005
## median_income                       -0.23749576          -0.009228171
## median_house_value                  -0.48485933           0.023416076
## ocean_proximityINLAND                1.00000000          -0.010614425
## ocean_proximityISLAND               -0.01061443           1.000000000
## ocean_proximityNEAR.BAY             -0.24088703          -0.005498984
## ocean_proximityNEAR.OCEAN           -0.26216349          -0.005984684
##                           ocean_proximityNEAR.BAY ocean_proximityNEAR.OCEAN
## longitude                            -0.474488910               0.045508838
## latitude                              0.358770991              -0.160817925
## housing_median_age                    0.255171663               0.021621556
## total_rooms                          -0.023022417              -0.009175150
## total_bedrooms                                 NA                        NA
## population                           -0.060880154              -0.024263727
## households                           -0.010093339               0.001714434
## median_income                         0.056196803               0.027343611
## median_house_value                    0.160284484               0.141862170
## ocean_proximityINLAND                -0.240887033              -0.262163488
## ocean_proximityISLAND                -0.005498984              -0.005984684
## ocean_proximityNEAR.BAY               1.000000000              -0.135818271
## ocean_proximityNEAR.OCEAN            -0.135818271               1.000000000

Linear Regression Analysis

## 
## Call:
## lm(formula = median_house_value ~ ., data = enc_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -556980  -42683  -10497   28765  779052 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -2.270e+06  8.801e+04 -25.791  < 2e-16 ***
## longitude                 -2.681e+04  1.020e+03 -26.296  < 2e-16 ***
## latitude                  -2.548e+04  1.005e+03 -25.363  < 2e-16 ***
## housing_median_age         1.073e+03  4.389e+01  24.439  < 2e-16 ***
## total_rooms               -6.193e+00  7.915e-01  -7.825 5.32e-15 ***
## total_bedrooms             1.006e+02  6.869e+00  14.640  < 2e-16 ***
## population                -3.797e+01  1.076e+00 -35.282  < 2e-16 ***
## households                 4.962e+01  7.451e+00   6.659 2.83e-11 ***
## median_income              3.926e+04  3.380e+02 116.151  < 2e-16 ***
## ocean_proximityINLAND     -3.928e+04  1.744e+03 -22.522  < 2e-16 ***
## ocean_proximityISLAND      1.529e+05  3.074e+04   4.974 6.62e-07 ***
## ocean_proximityNEAR.BAY   -3.954e+03  1.913e+03  -2.067  0.03879 *  
## ocean_proximityNEAR.OCEAN  4.278e+03  1.570e+03   2.726  0.00642 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68660 on 20420 degrees of freedom
##   (207 observations deleted due to missingness)
## Multiple R-squared:  0.6465, Adjusted R-squared:  0.6463 
## F-statistic:  3112 on 12 and 20420 DF,  p-value: < 2.2e-16

Linear Regression Confident Intervales Plot

Linear Regression Graph

The standard error measures the precision of a sample statistic (like the mean or a regression coefficient). It tells you how much the estimate is expected to vary if you repeated the sampling process many times.

  • A small SE means the estimate is more precise.

  • A large SE means the estimate is more variable (less reliable).

  • SE is used to compute confidence intervals and t-values.

A t-value (or t-statistic) is a number that comes from a t-test, which is a statistical method used to compare means and determine if differences are statistically significant. The t-value measures how far your sample statistic (like a sample mean) is from the null hypothesis value, in units of standard error.

  • A larger t-value (positive or negative) means your result is further from the null hypothesis.

  • A t-value near 0 means your sample mean is close to the null hypothesis mean.

  • You compare the t-value to a critical value (based on degrees of freedom and chosen confidence level) to decide if the result is statistically significant.

R-squared (also written as R^2) is a statistical measure that tells you how well your regression model fits the data. Specifically, it represents the proportion of the variance in the dependent variable that is explained by the independent variable(s).

Summary:

  1. The model fits reasonably well (R² ≈ 0.65).

  2. Most variables are statistically significant.

  3. median_income is the strongest positive predictor.

  4. Location features (longitude, latitude, ocean_proximity) are very important.

  5. Population and housing structure (rooms, households) affect value but may be entangled in multicollinearity1.

Very Important Predictors


Predictor Estimate t-value Observations
median_income +39,260 116.2 Strongest positive effect on house value. More income = higher house value.
population −37.97 −35.3 Larger populations are associated with lower house values.
longitude −26,810 −26.3 More western location (longitude more negative) = lower value.
latitude −25,480 −25.4 More northern location = lower value. (Suggests high-value areas are clustered in southern California.)
housing_median_age +1,073 +24.4 Older homes tend to be more valuable.
ocean_proximityINLAND −39,280 −22.5 Inland properties are much cheaper compared to the reference category.

Important Predictors

Predictor Estimate t-value Observations
total_bedrooms +100.6 +14.6 More bedrooms = higher value, but likely correlated with income or household size.
total_rooms −6.19 −7.83 Surprisingly negative, may indicate multicollinearity (e.g., with households or bedrooms).
households +49.6 +6.66 More households = higher median value (urban/suburban effect).
ocean_proximityISLAND +152,900 +4.97 Island properties are significantly more valuable.

Less Important (still statistically significant)

Predictor Estimate t-value Observations
ocean_proximityNEAR.OCEAN +4,278 +2.73 Small positive impact on value.
ocean_proximityNEAR.BAY −3,954 −2.07 Small negative effect (barely significant).

Next Steps

  • Plot absolute t-values or standardized coefficients.

  • Use stepwise selection, Lasso regression, or random forest to compare and confirm variable impact.

  • Check for multicollinearity (e.g., using VIF scores) to see if some variables are redundant.


  1. Multicollinearity happens when two or more predictor variables in a regression model are highly correlated with each other. This means they contain overlapping information, which makes it hard for the model to determine which variable is actually influencing the outcome.↩︎